Include metadata in numpy/polars cache fingerprints to prevent collisions by skrawcz · Pull Request #1616 · apache/hamilton

skrawcz · 2026-05-29T22:19:46Z

Summary

Fix hash collisions in the caching subsystem's fingerprinting for numpy arrays and polars DataFrames.

Problem

hash_numpy_array used only obj.tobytes(), which discards shape and dtype. Arrays with identical raw bytes but different shapes (e.g., shape=(6,) vs shape=(2,3)) or different dtypes (e.g., float32(1.0) vs int32(1065353216)) produced identical cache keys.
hash_polars_dataframe used only obj.hash_rows(), which discards column names. DataFrames with identical cell values but different schemas produced identical cache keys.

Both could cause the cache to silently return incorrect results from a previous execution.

Fix

hash_numpy_array: prepend f"{obj.shape}:{obj.dtype}" to the bytes before hashing
hash_polars_dataframe: include column names and dtypes (schema) alongside row hashes

Backwards compatibility

This changes hash output for numpy arrays and polars DataFrames. Existing caches will miss (different hash = recomputation), not produce incorrect results. Users will see a one-time recomputation after upgrading but no manual cache clearing is needed.

Tests

Added tests verifying:

Different shapes produce different hashes
Different dtypes with same bit pattern produce different hashes
Different column names produce different hashes
Identical data still produces identical hashes

Reported-by: Dem0

…ions hash_numpy_array now includes shape and dtype in the hash, preventing collisions between arrays with identical raw bytes but different semantics (e.g., shape=(6,) vs shape=(2,3)). hash_polars_dataframe now includes column names and dtypes (schema) in the hash, preventing collisions between DataFrames with identical cell values but different column schemas. Existing caches will simply miss (different hash = recomputation), not produce incorrect results. Reported-by: Dem0

elijahbenizzy

And closing out my duplicate

Dev-iL

If we're redoing hashes - perhaps it might be worth switching from sha224/md5 to xxh3?

ArnavBalyan

Lgtm ty for the change

skrawcz · 2026-06-01T06:09:44Z

If we're redoing hashes - perhaps it might be worth switching from sha224/md5 to xxh3?

separate PR? would have to validate license, etc.

elijahbenizzy approved these changes May 30, 2026

View reviewed changes

elijahbenizzy mentioned this pull request May 30, 2026

fix(caching): include shape/dtype/schema in fingerprint hashing #1617

Closed

Dev-iL approved these changes May 31, 2026

View reviewed changes

ArnavBalyan approved these changes May 31, 2026

View reviewed changes

skrawcz merged commit 7644a30 into main Jun 1, 2026
6 checks passed

skrawcz deleted the stefan/fix-cache-fingerprinting-collisions branch June 1, 2026 06:09

This was referenced Jun 1, 2026

Speed up and harden cache fingerprinting: xxh3_128 + vectorized DataFrame hashing + collision fixes #1619

Closed

Vectorize pandas and polars DataFrame hashing #1629

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include metadata in numpy/polars cache fingerprints to prevent collisions#1616

Include metadata in numpy/polars cache fingerprints to prevent collisions#1616
skrawcz merged 1 commit into
mainfrom
stefan/fix-cache-fingerprinting-collisions

skrawcz commented May 29, 2026

Uh oh!

elijahbenizzy left a comment

Uh oh!

Dev-iL left a comment

Uh oh!

ArnavBalyan left a comment

Uh oh!

skrawcz commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

skrawcz commented May 29, 2026

Summary

Problem

Fix

Backwards compatibility

Tests

Uh oh!

elijahbenizzy left a comment

Choose a reason for hiding this comment

Uh oh!

Dev-iL left a comment

Choose a reason for hiding this comment

Uh oh!

ArnavBalyan left a comment

Choose a reason for hiding this comment

Uh oh!

skrawcz commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants